{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a480f6f3",
   "metadata": {},
   "source": [
    "# RAG Ecosystem\n",
    "\n",
    "[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/) [![LangChain](https://img.shields.io/badge/LangChain-%23007ACC.svg?logo=LangChain)](https://www.langchain.com/) [![DeepEval](https://img.shields.io/badge/DeepEval-Evaluation-orange)](https://github.com/confident-ai/deepeval) [![RAGAS](https://img.shields.io/badge/RAGAS-Evaluation-blueviolet)](https://github.com/explodinggradients/ragas) [![OpenAI](https://img.shields.io/badge/OpenAI-API-lightgrey)](https://openai.com/) [![Cohere](https://img.shields.io/badge/Cohere-API-yellowgreen)](https://cohere.com/) [![Medium](https://img.shields.io/badge/Medium-Blog-black?logo=medium)](https://medium.com/@fareedkhandev/8f23349b96a4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f2a958b",
   "metadata": {},
   "source": [
    "Creating an entire RAG based AI system depends on many different components with each requires it’s own optimization and careful implementation. These components includes:\n",
    "\n",
    "![Production Ready RAG System (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:2400/1*ZjozYulECfqrzgMaTEZ-Rg.png)\n",
    "\n",
    "- **Query Transformations:** Rewriting user questions to be more effective for retrieval.\n",
    "- **Intelligent Routing:** Directing a query to the correct data source or a specialized tool.\n",
    "- **Indexing:** Creating a multi-layered knowledge base.\n",
    "- **Retrieval and Re-ranking:** Filtering noise and prioritizing the most relevant context.\n",
    "- **Self-Correcting Agentic Flows:** Building systems that can grade and improve their own work.\n",
    "- **End-to-End Evaluation:** Objectively measuring the performance of the entire pipeline.\n",
    "\n",
    "and much more …\n",
    "\n",
    "> We will learn and code each part of the RAG ecosystem along with visuals for easier understanding, starting from the basics to advanced techniques.\n",
    "\n",
    "All the code (Theory + Notebook) is available in my GitHub Repo:\n",
    "\n",
    "[[[[[[[[[[[[[[ LINK ]]]]]]]]]]]]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "893b98a0",
   "metadata": {},
   "source": [
    "## Table of Contents\n",
    "\n",
    "- [Understanding Basic RAG System](#part1)\n",
    "  - [Indexing Phase](#part1-1)\n",
    "  - [Retrieval](#part1-2)\n",
    "  - [Generation](#part1-3)\n",
    "- [Advanced Query Transformations](#part2)\n",
    "  - [Multi-Query Generation](#part2-1)\n",
    "  - [RAG-Fusion](#part2-2)\n",
    "  - [Decomposition](#part2-3)\n",
    "  - [Step-Back Prompting](#part2-4)\n",
    "  - [HyDE](#part2-5)\n",
    "- [Routing & Query Construction](#part3)\n",
    "  - [Logical Routing](#part3-1)\n",
    "  - [Semantic Routing](#part3-2)\n",
    "  - [Query Structuring](#part3-3)\n",
    "- [Advanced Indexing Strategies](#part4)\n",
    "  - [Multi-Representation Indexing](#part4-1)\n",
    "  - [Hierarchical Indexing (RAPTOR) Knowledge Tree](#part4-2)\n",
    "  - [Token-Level Precision (ColBERT)](#part4-3)\n",
    "- [Advanced Retrieval & Generation](#part5)\n",
    "  - [Dedicated Re-ranking](#part5-1)\n",
    "  - [Self-Correction using AI Agents](#part5-2)\n",
    "  - [Impact of Long Context](#part5-3)\n",
    "- [Manual RAG Evaluation](#part6)\n",
    "  - [The Core Metrics: What Should We Measure?](#part6-1)\n",
    "  - [Building Evaluators from Scratch with LangChain](#part6-2)\n",
    "- [Evaluation with Frameworks](#part7)\n",
    "  - [Rapid Evaluation with deepeval](#part7-1)\n",
    "  - [Another Powerful Alternative with grouse](#part7-2)\n",
    "  - [Evaluation with RAGAS](#part7-3)\n",
    "- [Summarizing Everything](#part8)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5610337",
   "metadata": {},
   "source": [
    "<a id='part1'></a>\n",
    "# Understanding Basic RAG System\n",
    "\n",
    "Before we look into the basics of RAG, let’s install core Python libraries commonly used for AI products, such as LangChain and others."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "83844704",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: langchain in /path/to/your/env/lib/python3.9/site-packages (0.1.0)\n",
      "Requirement already satisfied: langchain_community in /path/to/your/env/lib/python3.9/site-packages (0.1.0)\n",
      "Requirement already satisfied: langchain-openai in /path/to/your/env/lib/python3.9/site-packages (0.1.0)\n",
      "Requirement already satisfied: langchainhub in /path/to/your/env/lib/python3.9/site-packages (0.1.0)\n",
      "Requirement already satisfied: chromadb in /path/to/your/env/lib/python3.9/site-packages (0.4.22)\n",
      "Requirement already satisfied: tiktoken in /path/to/your/env/lib/python3.9/site-packages (0.5.2)\n",
      "Successfully installed langchain langchain_community langchain-openai langchainhub chromadb tiktoken\n"
     ]
    }
   ],
   "source": [
    "# Installing Required Modules\n",
    "!pip install langchain langchain_community langchain-openai langchainhub chromadb tiktoken"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0517d6a4",
   "metadata": {},
   "source": [
    "We can now simply set the environment variables for tracing and other tasks, such as the LLMs API provider we will be using."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "b98d5f60",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# Set LangChain API endpoint and API key for tracing with LangSmith\n",
    "os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'\n",
    "os.environ['LANGCHAIN_API_KEY'] = '<your-api-key>'  # Replace with your LangChain API key\n",
    "\n",
    "# Set OpenAI API key for using OpenAI models\n",
    "os.environ['OPENAI_API_KEY'] = '<your-api-key>'  # Replace with your OpenAI API key"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "091c051a",
   "metadata": {},
   "source": [
    "You can obtain your `LangSmith` API key from [their official documentation](https://www.langchain.com/langsmith) to trace our RAG product throughout this blog. For the LLM, we will be using the `OpenAI` API but as you may already know, `LangChain` supports a variety of LLM providers as well.\n",
    "\n",
    "The core RAG pipeline is the foundation of any advanced system, and understanding its components is important. Therefore, before going into the details of advanced components, we first need to understand the core logic of how a RAG system works, **but you can skip this section if you are already aware of how RAG system works.**\n",
    "\n",
    "![Basic RAG system (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:1250/1*c_yxo0cUH8u7o5an-Tzi0g.png)\n",
    "\n",
    "This simplest RAG can be break into three components:\n",
    "\n",
    "- **Indexing**: Organize and store data in a structured format to enable efficient searching.\n",
    "- **Retrieval**: Search and fetch relevant data based on a query or input.\n",
    "- **Generation**: Create a final response or output using the retrieved data.\n",
    "\n",
    "Let’s build this simple pipeline from the ground up to see how each piece works."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c100be33",
   "metadata": {},
   "source": [
    "<a id='part1-1'></a>\n",
    "## Indexing Phase\n",
    "\n",
    "Before our RAG system can answer any questions, it needs knowledge to draw from. For this, we’ll use a `WebBaseLoader` to pull content directly from [Lilian Weng's excellent blog post](https://lilianweng.github.io/posts/2023-06-23-agent/) on LLM-powered agents.\n",
    "\n",
    "![Indexing phase (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:875/1*dnSg_QmGd4J030_bznvUPw.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "7fbbf0b6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import bs4\n",
    "from langchain_community.document_loaders import WebBaseLoader\n",
    "\n",
    "# Initialize a web document loader with specific parsing instructions\n",
    "loader = WebBaseLoader(\n",
    "    web_paths=(\"https://lilianweng.github.io/posts/2023-06-23-agent/\",),  # URL of the blog post to load\n",
    "    bs_kwargs=dict(\n",
    "        parse_only=bs4.SoupStrainer(\n",
    "            class_=(\"post-content\", \"post-title\", \"post-header\")  # Only parse specified HTML classes\n",
    "        )\n",
    "    ),\n",
    ")\n",
    "\n",
    "# Load the filtered content from the web page into documents\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f15a641e",
   "metadata": {},
   "source": [
    "The `bs_kwargs` argument helps us target only the relevant HTML tags (`post-content`, `post-title`, etc.), cleaning up our data from the start.\n",
    "\n",
    "Now that we have the document, we face our first challenge. Feeding a massive document directly into an LLM is inefficient and often impossible due to context window limits.\n",
    "\n",
    "> This is why **chunking** is a critical step. We need to break the document into smaller, semantically meaningful pieces.\n",
    "\n",
    "The `RecursiveCharacterTextSplitter` is the recommended tool for this job because it intelligently tries to keep paragraphs and sentences intact."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "ac6bab57",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "\n",
    "# Create a text splitter to divide text into chunks of 1000 characters with 200-character overlap\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)\n",
    "\n",
    "# Split the loaded documents into smaller chunks\n",
    "splits = text_splitter.split_documents(docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7a32468",
   "metadata": {},
   "source": [
    "With `chunk_size=1000`, we are creating chunks of 1000 characters, and `chunk_overlap=200` ensures there is some continuity between them, which helps preserve context.\n",
    "\n",
    "Our text is now split, but it’s still just text. To perform similarity searches, we need to convert these chunks into numerical representations called **embeddings**. We will then store these embeddings in a **vector store**, which is a specialized database designed for efficient searching of vectors.\n",
    "\n",
    "The `Chroma` vector store and `OpenAIEmbeddings` make this incredibly simple. The following line handles both embedding and indexing in one go."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "84e824e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.vectorstores import Chroma\n",
    "from langchain_openai import OpenAIEmbeddings\n",
    "\n",
    "# Embed the text chunks and store them in a Chroma vector store for similarity search\n",
    "vectorstore = Chroma.from_documents(\n",
    "    documents=splits, \n",
    "    embedding=OpenAIEmbeddings()  # Use OpenAI's embedding model to convert text into vectors\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb2192e6",
   "metadata": {},
   "source": [
    "With our knowledge indexed, we are now ready to start asking questions."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2fc733ac",
   "metadata": {},
   "source": [
    "<a id='part1-2'></a>\n",
    "## Retrieval\n",
    "\n",
    "The vector store is our library, and the **retriever** is our smart librarian. It takes a user’s query, embeds it, and then fetches the most semantically similar chunks from the vector store.\n",
    "\n",
    "![Retrieval Phase (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:1250/1*jtf1FoBGfpnDPTTu9N94Wg.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "d4b59828",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a retriever from the vector store\n",
    "retriever = vectorstore.as_retriever()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef3627ad",
   "metadata": {},
   "source": [
    "Let’s test it. We’ll ask a question and see what our retriever finds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "d9fb8243",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Task decomposition can be done (1) by LLM with simple prompting like \"Steps for XYZ.\", \"What are the subgoals for achieving XYZ?\", (2) by using task-specific instructions; e.g. \"Write a story outline.\" for writing a novel, or (3) with human inputs.\n",
      "\n",
      "Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote."
     ]
    }
   ],
   "source": [
    "# Retrieve relevant documents for a query\n",
    "docs = retriever.get_relevant_documents(\"What is Task Decomposition?\")\n",
    "\n",
    "# Print the content of the first retrieved document\n",
    "print(docs[0].page_content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e3272bc",
   "metadata": {},
   "source": [
    "As you can see, the retriever successfully pulled the most relevant chunk from the blog post that directly discusses “Task decomposition.” This piece of context is exactly what the LLM needs to form an accurate answer."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed42d28c",
   "metadata": {},
   "source": [
    "<a id='part1-3'></a>\n",
    "## Generation\n",
    "\n",
    "We have our context, but we need an LLM to read it and formulate a human-friendly answer. This is the **“Generation”** step in RAG.\n",
    "\n",
    "![Generation Step (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:1250/1*0K6ognTAEOJQmb6KDL9wBw.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "edadc16b",
   "metadata": {},
   "source": [
    "First, we need a good prompt template. This instructs the LLM on how to behave. Instead of writing our own, we can pull a pre-optimized one from LangChain Hub."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "e5f72b5a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "input_variables=['context', 'question'] output_parser=StrOutputParser() partial_variables={} template='You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don\\'t know the answer, just say that you don\\'t know. Use three sentences maximum and keep the answer concise.\\nQuestion: {question} \\nContext: {context} \\nAnswer:' template_format='f-string' validate_template=True"
     ]
    }
   ],
   "source": [
    "from langchain import hub\n",
    "\n",
    "# Pull a pre-made RAG prompt from LangChain Hub\n",
    "prompt = hub.pull(\"rlm/rag-prompt\")\n",
    "\n",
    "# printing the prompt\n",
    "print(prompt)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2dbe226",
   "metadata": {},
   "source": [
    "Next, we initialize our LLM. We’ll use `gpt-3.5-turbo`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "227f0716",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "# Initialize the LLM\n",
    "llm = ChatOpenAI(model_name=\"gpt-3.5-turbo\", temperature=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ded9935",
   "metadata": {},
   "source": [
    "Now for the final step: chaining everything together. Using the LangChain Expression Language (LCEL), we can pipe the output of one component into the input of the next."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "dbeb053f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.output_parsers import StrOutputParser\n",
    "from langchain_core.runnables import RunnablePassthrough\n",
    "\n",
    "# Helper function to format retrieved documents\n",
    "def format_docs(docs):\n",
    "    return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
    "\n",
    "# Define the full RAG chain\n",
    "rag_chain = (\n",
    "    {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
    "    | prompt\n",
    "    | llm\n",
    "    | StrOutputParser()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "322d9050",
   "metadata": {},
   "source": [
    "Let’s break down this chain:\n",
    "\n",
    "1. `{\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}`: This part runs in parallel. It sends the user's question to the `retriever` to get documents, which are then formatted into a single string by `format_docs`. Simultaneously, `RunnablePassthrough` passes the original question through unchanged.\n",
    "2. `| prompt`: The context and question are fed into our prompt template.\n",
    "3. `| llm`: The formatted prompt is sent to the LLM.\n",
    "4. `| StrOutputParser()`: This cleans up the LLM's output into a simple string.\n",
    "\n",
    "Now, let’s invoke the entire chain."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "511eb59f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Task decomposition is a technique used to break down large tasks into smaller, more manageable subgoals. This can be achieved by using a Large Language Model (LLM) with simple prompts, task-specific instructions, or human inputs. For example, Tree of Thoughts is a method that extends Chain of Thought by exploring multiple reasoning possibilities at each step, decomposing the problem into multiple thought steps and generating multiple thoughts per step in a tree structure."
     ]
    }
   ],
   "source": [
    "# Ask a question using the RAG chain\n",
    "response = rag_chain.invoke(\"What is Task Decomposition?\")\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e71edc1",
   "metadata": {},
   "source": [
    "And there we have it, our RAG pipeline successfully retrieved relevant information about **“Task Decomposition”** and used it to generate a concise, accurate answer. This simple chain forms the foundation upon which we will build more advanced and powerful capabilities."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa1833de",
   "metadata": {},
   "source": [
    "<a id='part2'></a>\n",
    "# Advanced Query Transformations\n",
    "\n",
    "So, now that we understand the fundamentals of RAG pipeline. But production systems often reveal the limitations of this basic approach. One of the most common failure points is the user’s query itself.\n",
    "\n",
    "![Query Transformation (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:1250/1*FO2U9QA49kjn6OaBGZuq8A.png)\n",
    "\n",
    "> A query might be too specific, too broad, or use different vocabulary than our source documents, leading to poor retrieval results.\n",
    "\n",
    "The solution isn’t to blame the user, it’s to make our system smarter. **Query Transformation** is a set of powerful techniques designed to re-write, expand, or break down the original question to significantly improve retrieval accuracy.\n",
    "\n",
    "Instead of relying on a single query, we’ll engineer multiple, better-informed queries to cast a wider and more accurate net.\n",
    "\n",
    "To test these new techniques, we will use the same indexed knowledge base from Basic RAG pipeline section that we have just gone through previously. This ensures we can directly compare the results and see the improvements.\n",
    "\n",
    "As a quick refresher, here’s how we set up our retriever:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "3c2de569",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the blog post\n",
    "loader = WebBaseLoader(\n",
    "    web_paths=(\"https://lilianweng.github.io/posts/2023-06-23-agent/\",),\n",
    "    bs_kwargs=dict(\n",
    "        parse_only=bs4.SoupStrainer(\n",
    "            class_=(\"post-content\", \"post-title\", \"post-header\")\n",
    "        )\n",
    "    ),\n",
    ")\n",
    "blog_docs = loader.load()\n",
    "\n",
    "# Split the documents into chunks\n",
    "text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(\n",
    "    chunk_size=300, \n",
    "    chunk_overlap=50\n",
    ")\n",
    "splits = text_splitter.split_documents(blog_docs)\n",
    "\n",
    "# Index the chunks in a Chroma vector store\n",
    "vectorstore = Chroma.from_documents(documents=splits, \n",
    "                                    embedding=OpenAIEmbeddings())\n",
    "\n",
    "# Create our retriever\n",
    "retriever = vectorstore.as_retriever()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee6529e4",
   "metadata": {},
   "source": [
    "Now, with our retriever ready, let’s explore our first query transformation technique."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0fa37693",
   "metadata": {},
   "source": [
    "<a id='part2-1'></a>\n",
    "## Multi-Query Generation\n",
    "\n",
    "A single user query represents just one perspective. Distance-based similarity search might miss relevant documents that use synonyms or discuss related concepts.\n",
    "\n",
    "The Multi-Query approach tackles this by using an LLM to generate several different versions of the user’s question, effectively searching from multiple angles.\n",
    "\n",
    "![Multi-Query Optimization (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:1250/1*GjZoAISn6Jv3CBH87zUNPA.png)\n",
    "\n",
    "We’ll start by creating a prompt that instructs the LLM to generate these alternative questions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "57498a91",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.prompts import ChatPromptTemplate\n",
    "from langchain_core.output_parsers import StrOutputParser\n",
    "\n",
    "# Prompt for generating multiple queries\n",
    "template = \"\"\"You are an AI language model assistant. Your task is to generate five \n",
    "different versions of the given user question to retrieve relevant documents from a vector \n",
    "database. By generating multiple perspectives on the user question, your goal is to help\n",
    "the user overcome some of the limitations of the distance-based similarity search. \n",
    "Provide these alternative questions separated by newlines. Original question: {question}\"\"\"\n",
    "prompt_perspectives = ChatPromptTemplate.from_template(template)\n",
    "\n",
    "# Chain to generate the queries\n",
    "generate_queries = (\n",
    "    prompt_perspectives \n",
    "    | ChatOpenAI(temperature=0) \n",
    "    | StrOutputParser() \n",
    "    | (lambda x: x.split(\"\\n\"))\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "222c1244",
   "metadata": {},
   "source": [
    "Let’s test this chain and see what kind of queries it generates for our question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "352fc1a8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1. How can LLM agents break down complex tasks?\n",
      "2. What is the process of task decomposition in the context of large language model agents?\n",
      "3. What are the methods for decomposing tasks for LLM-powered agents?\n",
      "4. Explain the concept of task decomposition as it applies to AI agents using LLMs.\n",
      "5. In what ways do LLM agents handle task decomposition?\n"
     ]
    }
   ],
   "source": [
    "question = \"What is task decomposition for LLM agents?\"\n",
    "generated_queries_list = generate_queries.invoke({\"question\": question})\n",
    "\n",
    "# Print the generated queries\n",
    "for i, q in enumerate(generated_queries_list):\n",
    "    print(f\"{i+1}. {q}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "984919ce",
   "metadata": {},
   "source": [
    "This is excellent. The LLM has rephrased our original question using different keywords like “break down complex tasks”, “methods”, and “process.” Now, we can retrieve documents for all of these queries and combine the results. A simple way to combine them is to take the unique set of all retrieved documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "4d23d057",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total unique documents retrieved: 6\n"
     ]
    }
   ],
   "source": [
    "from langchain.load import dumps, loads\n",
    "\n",
    "def get_unique_union(documents: list[list]):\n",
    "    \"\"\" A simple function to get the unique union of retrieved documents \"\"\"\n",
    "    # Flatten the list of lists and convert each Document to a string for uniqueness\n",
    "    flattened_docs = [dumps(doc) for sublist in documents for doc in sublist]\n",
    "    unique_docs = list(set(flattened_docs))\n",
    "    return [loads(doc) for doc in unique_docs]\n",
    "\n",
    "# Build the retrieval chain\n",
    "retrieval_chain = generate_queries | retriever.map() | get_unique_union\n",
    "\n",
    "# Invoke the chain and check the number of documents retrieved\n",
    "docs = retrieval_chain.invoke({\"question\": question})\n",
    "print(f\"Total unique documents retrieved: {len(docs)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00b0cba2",
   "metadata": {},
   "source": [
    "By searching with five different queries, we retrieved a total of 6 unique documents, likely capturing a more comprehensive set of information than a single query would have. Now we can feed this context into our final RAG chain."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "19ba58f7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Task decomposition for LLM agents involves breaking down large, complex tasks into smaller, more manageable sub-goals. This allows the agent to work through a problem systematically. Methods for decomposition include using the LLM itself with simple prompts (e.g., \"Steps for XYZ.\"), applying task-specific instructions, or incorporating human inputs to guide the process."
     ]
    }
   ],
   "source": [
    "from operator import itemgetter\n",
    "from langchain_core.runnables import RunnablePassthrough\n",
    "\n",
    "# The final RAG chain\n",
    "template = \"\"\"Answer the following question based on this context:\n",
    "\n",
    "{context}\n",
    "\n",
    "Question: {question}\n",
    "\"\"\"\n",
    "prompt = ChatPromptTemplate.from_template(template)\n",
    "llm = ChatOpenAI(temperature=0)\n",
    "\n",
    "final_rag_chain = (\n",
    "    {\"context\": retrieval_chain, \"question\": itemgetter(\"question\")} \n",
    "    | prompt\n",
    "    | llm\n",
    "    | StrOutputParser()\n",
    ")\n",
    "\n",
    "print(final_rag_chain.invoke({\"question\": question}))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce069080",
   "metadata": {},
   "source": [
    "> This answer is more robust because it’s based on a wider pool of relevant documents."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cabfa157",
   "metadata": {},
   "source": [
    "<a id='part2-2'></a>\n",
    "## RAG-Fusion\n",
    "\n",
    "Multi-Query is a great start, but simply taking a union of documents treats them all equally. What if one document was ranked highly by three of our queries, while another was a low-ranked result from only one?\n",
    "\n",
    "![RAG Fusion (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:1250/1*qIJlH2bVjc1ZZflcniuHCw.png)\n",
    "\n",
    "The first is clearly more important. RAG-Fusion improves on Multi-Query by not just fetching documents, but also …\n",
    "\n",
    "> **re-ranking** them using a technique called **Reciprocal Rank Fusion (RRF)**.\n",
    "\n",
    "RRF intelligently combines results from multiple searches. It boosts the score of documents that appear consistently high across different result lists, pushing the most relevant content to the top.\n",
    "\n",
    "The code is very similar, but we’ll swap our `get_unique_union` function with an RRF implementation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "2c5475a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "def reciprocal_rank_fusion(results: list[list], k=60):\n",
    "    \"\"\" Reciprocal Rank Fusion that intelligently combines multiple ranked lists \"\"\"\n",
    "    fused_scores = {}\n",
    "\n",
    "    # Iterate through each list of ranked documents\n",
    "    for docs in results:\n",
    "        for rank, doc in enumerate(docs):\n",
    "            doc_str = dumps(doc)\n",
    "            if doc_str not in fused_scores:\n",
    "                fused_scores[doc_str] = 0\n",
    "            # The core of RRF: documents ranked higher (lower rank value) get a larger score\n",
    "            fused_scores[doc_str] += 1 / (rank + k)\n",
    "\n",
    "    # Sort documents by their new fused scores in descending order\n",
    "    reranked_results = [\n",
    "        (loads(doc), score)\n",
    "        for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)\n",
    "    ]\n",
    "    return reranked_results"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1fdeaf42",
   "metadata": {},
   "source": [
    "The above function will re-rank the documents after they are fetched through similarity search, but we haven’t initialized it yet so let’s do that now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "e180b3ef",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total re-ranked documents retrieved: 7\n"
     ]
    }
   ],
   "source": [
    "# Use a slightly different prompt for RAG-Fusion\n",
    "template = \"\"\"You are a helpful assistant that generates multiple search queries based on a single input query. \\n\n",
    "Generate multiple search queries related to: {question} \\n\n",
    "Output (4 queries):\"\"\"\n",
    "prompt_rag_fusion = ChatPromptTemplate.from_template(template)\n",
    "\n",
    "generate_queries = (\n",
    "    prompt_rag_fusion \n",
    "    | ChatOpenAI(temperature=0)\n",
    "    | StrOutputParser() \n",
    "    | (lambda x: x.split(\"\\n\"))\n",
    ")\n",
    "\n",
    "# Build the new retrieval chain with RRF\n",
    "retrieval_chain_rag_fusion = generate_queries | retriever.map() | reciprocal_rank_fusion\n",
    "docs = retrieval_chain_rag_fusion.invoke({\"question\": question})\n",
    "\n",
    "print(f\"Total re-ranked documents retrieved: {len(docs)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ef3ed27",
   "metadata": {},
   "source": [
    "The final chain remains the same, but now it receives a more intelligently ranked context. RAG-Fusion is a powerful, low-effort way to increase the quality of your retrieval."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c0aab1f",
   "metadata": {},
   "source": [
    "<a id='part2-3'></a>\n",
    "## Decomposition\n",
    "\n",
    "Some questions are too complex to be answered in a single step. For example, **“What are the main components of an LLM-powered agent, and how do they interact?”** This is really two questions in one.\n",
    "\n",
    "![Answer Recursively (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:1250/1*oYttQUN_G0J_TZtigWjsGQ.png)\n",
    "\n",
    "The Decomposition technique uses an LLM to break down a complex query into a set of simpler, self-contained sub-questions. We can then answer each one and synthesize a final answer.\n",
    "\n",
    "We’ll start with a prompt designed for this purpose."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "7c1c260c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['1. What are the core components of a system that uses a large language model to power an autonomous agent?', '2. How is memory implemented in LLM-powered autonomous agents?', '3. What role does planning and task decomposition play in an autonomous agent system powered by LLMs?']\n"
     ]
    }
   ],
   "source": [
    "# Decomposition prompt\n",
    "template = \"\"\"You are a helpful assistant that generates multiple sub-questions related to an input question. \\n\n",
    "The goal is to break down the input into a set of sub-problems / sub-questions that can be answers in isolation. \\n\n",
    "Generate multiple search queries related to: {question} \\n\n",
    "Output (3 queries):\"\"\"\n",
    "prompt_decomposition = ChatPromptTemplate.from_template(template)\n",
    "\n",
    "# Chain to generate sub-questions\n",
    "generate_queries_decomposition = (\n",
    "    prompt_decomposition \n",
    "    | ChatOpenAI(temperature=0) \n",
    "    | StrOutputParser() \n",
    "    | (lambda x: x.split(\"\\n\"))\n",
    ")\n",
    "\n",
    "# Generate and print the sub-questions\n",
    "question = \"What are the main components of an LLM-powered autonomous agent system?\"\n",
    "sub_questions = generate_queries_decomposition.invoke({\"question\": question})\n",
    "print(sub_questions)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e2693e4",
   "metadata": {},
   "source": [
    "The LLM successfully decomposed our complex question. Now, we can answer each of these individually and combine the results. One effective method is to answer each sub-question and use the resulting Q&A pairs as context to synthesize a final, comprehensive answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "f3284577",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "An LLM-powered autonomous agent system primarily consists of three core components: planning, memory, and tool use. Planning involves decomposing large tasks into smaller, manageable sub-goals. Memory allows the agent to learn from past actions and retain information, using both short-term and long-term storage. Finally, tool use enables the agent to interact with external environments to gather information and perform actions beyond its inherent capabilities. These components work in concert to allow the agent to reason, plan, and execute complex tasks autonomously."
     ]
    }
   ],
   "source": [
    "# RAG prompt\n",
    "prompt_rag = hub.pull(\"rlm/rag-prompt\")\n",
    "\n",
    "# A list to hold the answers to our sub-questions\n",
    "rag_results = []\n",
    "for sub_question in sub_questions:\n",
    "    # Retrieve documents for each sub-question\n",
    "    retrieved_docs = retriever.get_relevant_documents(sub_question)\n",
    "    \n",
    "    # Use our standard RAG chain to answer the sub-question\n",
    "    answer = (prompt_rag | llm | StrOutputParser()).invoke({\"context\": retrieved_docs, \"question\": sub_question})\n",
    "    rag_results.append(answer)\n",
    "\n",
    "def format_qa_pairs(questions, answers):\n",
    "    \"\"\"Format Q and A pairs\"\"\"\n",
    "    formatted_string = \"\"\n",
    "    for i, (question, answer) in enumerate(zip(questions, answers), start=1):\n",
    "        formatted_string += f\"Question {i}: {question}\\nAnswer {i}: {answer}\\n\\n\"\n",
    "    return formatted_string.strip()\n",
    "\n",
    "# Format the Q&A pairs into a single context string\n",
    "context = format_qa_pairs(sub_questions, rag_results)\n",
    "\n",
    "# Final synthesis prompt\n",
    "template = \"\"\"Here is a set of Q+A pairs:\n",
    "\n",
    "{context}\n",
    "\n",
    "Use these to synthesize an answer to the original question: {question}\n",
    "\"\"\"\n",
    "prompt = ChatPromptTemplate.from_template(template)\n",
    "\n",
    "final_rag_chain = (\n",
    "    prompt\n",
    "    | llm\n",
    "    | StrOutputParser()\n",
    ")\n",
    "\n",
    "print(final_rag_chain.invoke({\"context\": context, \"question\": question}))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3e769f2",
   "metadata": {},
   "source": [
    "By breaking the problem down, we constructed a much more detailed and structured answer than we would have otherwise."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e81e22a",
   "metadata": {},
   "source": [
    "<a id='part2-4'></a>\n",
    "## Step-Back Prompting\n",
    "\n",
    "Sometimes, a user’s query is too specific, while our documents contain the more general, underlying information needed to answer it.\n",
    "\n",
    "![Step Back Prompting (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:875/1*6lrhGv1fdcmLKMVu5tU3uQ.png)\n",
    "\n",
    "> For example, a user might ask, “Could the members of The Police perform lawful arrests?”\n",
    "\n",
    "A direct search for this might fail. The Step-Back technique uses an LLM to take a “step back” and form a more general question, like “What are the powers and duties of the band The Police?” We then retrieve context for *both* the specific and general questions, providing a richer context for the final answer.\n",
    "\n",
    "We can teach the LLM this pattern using few-shot examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "1825be60",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate\n",
    "\n",
    "# Few-shot examples to teach the model how to generate step-back (more generic) questions\n",
    "examples = [\n",
    "    {\n",
    "        \"input\": \"Could the members of The Police perform lawful arrests?\",\n",
    "        \"output\": \"what can the members of The Police do?\",\n",
    "    },\n",
    "    {\n",
    "        \"input\": \"Jan Sindel's was born in what country?\",\n",
    "        \"output\": \"what is Jan Sindel's personal history?\",\n",
    "    },\n",
    "]\n",
    "\n",
    "# Define how each example is formatted in the prompt\n",
    "example_prompt = ChatPromptTemplate.from_messages([\n",
    "    (\"human\", \"{input}\"),  # User input\n",
    "    (\"ai\", \"{output}\")     # Model's response\n",
    "])\n",
    "\n",
    "# Wrap the few-shot examples into a reusable prompt template\n",
    "few_shot_prompt = FewShotChatMessagePromptTemplate(\n",
    "    example_prompt=example_prompt,\n",
    "    examples=examples,\n",
    ")\n",
    "\n",
    "# Full prompt includes system instruction, few-shot examples, and the user question\n",
    "prompt = ChatPromptTemplate.from_messages([\n",
    "    (\"system\", \n",
    "     \"You are an expert at world knowledge. Your task is to step back and paraphrase a question \"\n",
    "     \"to a more generic step-back question, which is easier to answer. Here are a few examples:\"),\n",
    "    few_shot_prompt,\n",
    "    (\"user\", \"{question}\"),\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "154ea14b",
   "metadata": {},
   "source": [
    "Now, we can simply define the chain for step back approach, so let’s do that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "f7b3610f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original Question: What is task decomposition for LLM agents?\n",
      "Step-Back Question: What are the different approaches to task decomposition in software engineering?\n"
     ]
    }
   ],
   "source": [
    "# Define a chain to generate step-back questions using the prompt and an OpenAI model\n",
    "generate_queries_step_back = prompt | ChatOpenAI(temperature=0) | StrOutputParser()\n",
    "\n",
    "# Run the chain on a specific question\n",
    "question = \"What is task decomposition for LLM agents?\"\n",
    "step_back_question = generate_queries_step_back.invoke({\"question\": question})\n",
    "\n",
    "# Output the original and generated step-back question\n",
    "print(f\"Original Question: {question}\")\n",
    "print(f\"Step-Back Question: {step_back_question}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af9a3c61",
   "metadata": {},
   "source": [
    "This is an important step-back question. It broadens the scope to general software engineering, which will likely pull in foundational documents that can then be combined with the specific context about LLM agents. Now we can build a chain that uses both."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "7e13c1b6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.runnables import RunnableLambda\n",
    "\n",
    "# Prompt for the final response\n",
    "response_prompt_template = \"\"\"You are an expert of world knowledge. I am going to ask you a question. Your response should be comprehensive and not contradicted with the following context if they are relevant. Otherwise, ignore them if they are not relevant.\n",
    "\n",
    "# Normal Context\n",
    "{normal_context}\n",
    "\n",
    "# Step-Back Context\n",
    "{step_back_context}\n",
    "\n",
    "# Original Question: {question}\n",
    "# Answer:\"\"\"\n",
    "response_prompt = ChatPromptTemplate.from_template(response_prompt_template)\n",
    "\n",
    "# The full chain\n",
    "chain = (\n",
    "    {\n",
    "        # Retrieve context using the normal question\n",
    "        \"normal_context\": RunnableLambda(lambda x: x[\"question\"]) | retriever,\n",
    "        # Retrieve context using the step-back question\n",
    "        \"step_back_context\": generate_queries_step_back | retriever,\n",
    "        # Pass on the original question\n",
    "        \"question\": lambda x: x[\"question\"],\n",
    "    }\n",
    "    | response_prompt\n",
    "    | ChatOpenAI(temperature=0)\n",
    "    | StrOutputParser()\n",
    ")\n",
    "\n",
    "response = chain.invoke({\"question\": question})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc7b8dfe",
   "metadata": {},
   "source": [
    "This is the output we get, when we run this step back prompt chain with our query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "c551aa50",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Task decomposition is a fundamental concept in software engineering where a complex problem is broken down into smaller, more manageable parts. In the context of LLM agents, this principle is applied to enable them to handle large tasks. By decomposing a task into sub-goals, the agent can plan and execute a series of simpler actions. This can be achieved through various methods, such as using the LLM itself to generate a step-by-step plan, following task-specific instructions, or by taking input from a human operator."
     ]
    }
   ],
   "source": [
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83694b96",
   "metadata": {},
   "source": [
    "<a id='part2-5'></a>\n",
    "## HyDE\n",
    "\n",
    "This final technique is one of the most clever. The core problem of retrieval is that a user’s query might use different words than the document (the “vocabulary mismatch” problem).\n",
    "\n",
    "![HyDE (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:1250/1*YQVJMOpDBU6l54atHoFpJg.png)\n",
    "\n",
    "**HyDE (Hypothetical Document Embeddings)** proposes a radical solution: First, have an LLM generate a *hypothetical* answer to the question. This fake document, while not factually correct, will be semantically rich and use the kind of language we expect to find in a real answer.\n",
    "\n",
    "We then embed this hypothetical document and use its embedding to perform the retrieval. The result is that we find real documents that are semantically very similar to an ideal answer.\n",
    "\n",
    "Let’s start by creating a prompt to generate this hypothetical document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "e023ddf1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Task decomposition in large language model (LLM) agents refers to the process of breaking down a complex, high-level task into a series of smaller, more manageable sub-tasks. This hierarchical approach is crucial for enabling agents to handle sophisticated goals that require multi-step reasoning and planning. The decomposition can be achieved through several mechanisms, including programmatic scripts, interaction with external tools, or recursive calls to the LLM itself with structured prompts. By dividing the problem space, the agent can focus on solving one sub-problem at a time, using the output of one step as the input for the next, thus creating a coherent and executable workflow.\n"
     ]
    }
   ],
   "source": [
    "# HyDE prompt\n",
    "template = \"\"\"Please write a scientific paper passage to answer the question\n",
    "Question: {question}\n",
    "Passage:\"\"\"\n",
    "prompt_hyde = ChatPromptTemplate.from_template(template)\n",
    "\n",
    "# Chain to generate the hypothetical document\n",
    "generate_docs_for_retrieval = (\n",
    "    prompt_hyde \n",
    "    | ChatOpenAI(temperature=0) \n",
    "    | StrOutputParser() \n",
    ")\n",
    "\n",
    "# Generate and print the hypothetical document\n",
    "hypothetical_document = generate_docs_for_retrieval.invoke({\"question\": question})\n",
    "print(hypothetical_document)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57de5e21",
   "metadata": {},
   "source": [
    "This passage is a perfect, textbook-style answer. Now, we use its embedding to find real documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "e77b8fa0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Task decomposition for LLM agents involves breaking down a larger task into smaller, more manageable subgoals. This can be done using techniques like Chain of Thought (CoT), which prompts the model for step-by-step thinking, or Tree of Thoughts, which explores multiple reasoning paths. The decomposition can be driven by the LLM itself through simple prompting, by using task-specific instructions, or by incorporating human inputs."
     ]
    }
   ],
   "source": [
    "# Retrieve documents using the HyDE approach\n",
    "retrieval_chain = generate_docs_for_retrieval | retriever \n",
    "retrieved_docs = retrieval_chain.invoke({\"question\": question})\n",
    "\n",
    "# Use our standard RAG chain to generate the final answer from the retrieved context\n",
    "response = final_rag_chain.invoke({\"context\": retrieved_docs, \"question\": question})\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19affa43",
   "metadata": {},
   "source": [
    "By using a hypothetical document as a **lure**, HyDE helped us zero in on the most relevant chunks in our knowledge base, demonstrating another powerful tool in our RAG toolkit."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9237d2ab",
   "metadata": {},
   "source": [
    "<a id='part3'></a>\n",
    "# Routing & Query Construction\n",
    "\n",
    "Our RAG system is getting smarter, but in a real-world scenario, knowledge isn’t stored in a single, uniform library.\n",
    "\n",
    "> We often have multiple data sources: documentation for different programming languages, internal wikis, public websites, or databases with structured metadata.\n",
    "\n",
    "![Routing and Query Transformation (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:1250/1*cost0_AWB8NKp0WxZlH7fA.png)\n",
    "\n",
    "Sending every query to every source is wildly inefficient and can lead to noisy, irrelevant results.\n",
    "\n",
    "This is where our RAG system needs to evolve from a simple librarian into an **intelligent switchboard operator**. It needs the ability to first *analyze* an incoming query and then *route* it to the correct destination or *construct* a more precise, structured query for retrieval. This section dives into the techniques that make this possible."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67e24fb2",
   "metadata": {},
   "source": [
    "<a id='part3-1'></a>\n",
    "## Logical Routing\n",
    "\n",
    "Routing is a classification problem. Given a user’s question, we need to classify it into one of several predefined categories. While traditional ML models can do this, we can leverage the powerful reasoning engine we already have: the LLM itself.\n",
    "\n",
    "![Logical Routing (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:875/1*PK9xKW0o-72xmmLaAAozeA.png)\n",
    "\n",
    "By providing the LLM with a clear schema (a set of possible categories), we can ask it to make the classification decision for us.\n",
    "\n",
    "We’ll start by defining the “contract” for our LLM’s output using a Pydantic model. This schema explicitly tells the LLM the possible destinations for a query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "d53c165b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Literal\n",
    "from langchain_core.pydantic_v1 import BaseModel, Field\n",
    "\n",
    "# Define the data model for our router's output\n",
    "class RouteQuery(BaseModel):\n",
    "    \"\"\"A data model to route a user query to the most relevant datasource.\"\"\"\n",
    "\n",
    "    # The 'datasource' field must be one of the three specified literal strings.\n",
    "    # This enforces a strict set of choices for the LLM.\n",
    "    datasource: Literal[\"python_docs\", \"js_docs\", \"golang_docs\"] = Field(\n",
    "        ...,  # The '...' indicates that this field is required.\n",
    "        description=\"Given a user question, choose which datasource would be most relevant for answering their question.\",\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e1fab2c",
   "metadata": {},
   "source": [
    "With our schema defined, we can now build the router chain. We’ll use a prompt to give the LLM its instructions and then use the `.with_structured_output()` method to ensure its response perfectly matches our `RouteQuery` model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "1b663642",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "# Initialize our LLM\n",
    "llm = ChatOpenAI(model=\"gpt-3.5-turbo-0125\", temperature=0)\n",
    "\n",
    "# Create a new LLM instance that is \"structured\" to output our Pydantic model\n",
    "structured_llm = llm.with_structured_output(RouteQuery)\n",
    "\n",
    "# The system prompt provides the core instruction for the LLM's task.\n",
    "system = \"\"\"You are an expert at routing a user question to the appropriate data source.\n",
    "\n",
    "Based on the programming language the question is referring to, route it to the relevant data source.\"\"\"\n",
    "\n",
    "# The full prompt template combines the system message and the user's question.\n",
    "prompt = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\"system\", system),\n",
    "        (\"human\", \"{question}\"),\n",
    "    ]\n",
    ")\n",
    "\n",
    "# Define the complete router chain\n",
    "router = prompt | structured_llm"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50ae15a4",
   "metadata": {},
   "source": [
    "Now, let’s test our router. We’ll pass it a question that is clearly about Python and inspect the output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "2bef3ef0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "datasource='python_docs'\n"
     ]
    }
   ],
   "source": [
    "question = \"\"\"Why doesn't the following code work:\n",
    "\n",
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "\n",
    "prompt = ChatPromptTemplate.from_messages([\"human\", \"speak in {language}\"])\n",
    "prompt.invoke(\"french\")\n",
    "\"\"\"\n",
    "\n",
    "# Invoke the router and check the result\n",
    "result = router.invoke({\"question\": question})\n",
    "\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e877b70f",
   "metadata": {},
   "source": [
    "The output is an instance of our `RouteQuery` model, and the LLM has correctly identified `python_docs` as the appropriate datasource. This structured output is now something we can reliably use in our code to implement branching logic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "d5110a11",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "chain for python_docs\n"
     ]
    }
   ],
   "source": [
    "from langchain_core.runnables import RunnableLambda\n",
    "\n",
    "def choose_route(result):\n",
    "    \"\"\"A function to determine the downstream logic based on the router's output.\"\"\"\n",
    "    if \"python_docs\" in result.datasource.lower():\n",
    "        # In a real app, this would be a complete RAG chain for Python docs\n",
    "        return \"chain for python_docs\"\n",
    "    elif \"js_docs\" in result.datasource.lower():\n",
    "        # This would be the chain for JavaScript docs\n",
    "        return \"chain for js_docs\"\n",
    "    else:\n",
    "        # And this for Go docs\n",
    "        return \"chain for golang_docs\"\n",
    "\n",
    "# The full chain now includes the routing and branching logic\n",
    "full_chain = router | RunnableLambda(choose_route)\n",
    "\n",
    "# Let's run the full chain\n",
    "final_destination = full_chain.invoke({\"question\": question})\n",
    "\n",
    "print(final_destination)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ecbd91ba",
   "metadata": {},
   "source": [
    "Our switchboard correctly routed the Python-related query. This approach is incredibly powerful for building multi-source RAG systems."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9e018b4",
   "metadata": {},
   "source": [
    "<a id='part3-2'></a>\n",
    "## Semantic Routing\n",
    "\n",
    "Logical routing works perfectly when you have clearly defined categories. But what if you want to route based on the *style* or *domain* of a question? For example, you might want to answer physics questions with a serious, academic tone and math questions with a step-by-step, pedagogical approach. This is where **Semantic Routing** comes in.\n",
    "\n",
    "![Semantic Routing (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:1250/1*mzz-ncmrzdwQU37GFgPeTw.png)\n",
    "\n",
    "> Instead of classifying the query, we define multiple expert prompts.\n",
    "\n",
    "We then embed the user’s query and each of our prompt templates, and use cosine similarity to find the prompt that is most semantically aligned with the query.\n",
    "\n",
    "First, let’s define our two expert personas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "d81b5b2f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.prompts import PromptTemplate\n",
    "\n",
    "# A prompt for a physics expert\n",
    "physics_template = \"\"\"You are a very smart physics professor. \\\n",
    "You are great at answering questions about physics in a concise and easy to understand manner. \\\n",
    "When you don't know the answer to a question you admit that you don't know.\n",
    "\n",
    "Here is a question:\n",
    "{query}\"\"\"\n",
    "\n",
    "# A prompt for a math expert\n",
    "math_template = \"\"\"You are a very good mathematician. You are great at answering math questions. \\\n",
    "You are so good because you are able to break down hard problems into their component parts, \\\n",
    "answer the component parts, and then put them together to answer the broader question.\n",
    "\n",
    "Here is a question:\n",
    "{query}\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "919b96fa",
   "metadata": {},
   "source": [
    "Now, we’ll create the routing function that performs the embedding and similarity comparison."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "5b74e86f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.utils.math import cosine_similarity\n",
    "from langchain_openai import OpenAIEmbeddings\n",
    "\n",
    "# Initialize the embedding model\n",
    "embeddings = OpenAIEmbeddings()\n",
    "\n",
    "# Store our templates and their embeddings for comparison\n",
    "prompt_templates = [physics_template, math_template]\n",
    "prompt_embeddings = embeddings.embed_documents(prompt_templates)\n",
    "\n",
    "def prompt_router(input):\n",
    "    \"\"\"A function to route the input query to the most similar prompt template.\"\"\"\n",
    "    # 1. Embed the incoming user query\n",
    "    query_embedding = embeddings.embed_query(input[\"query\"])\n",
    "    \n",
    "    # 2. Compute the cosine similarity between the query and all prompt templates\n",
    "    similarity = cosine_similarity([query_embedding], prompt_embeddings)[0]\n",
    "    \n",
    "    # 3. Find the index of the most similar prompt\n",
    "    most_similar_index = similarity.argmax()\n",
    "    \n",
    "    # 4. Select the most similar prompt template\n",
    "    chosen_prompt = prompt_templates[most_similar_index]\n",
    "    \n",
    "    print(f\"DEBUG: Using {'MATH' if most_similar_index == 1 else 'PHYSICS'} template.\")\n",
    "    \n",
    "    # 5. Return the chosen prompt object\n",
    "    return PromptTemplate.from_template(chosen_prompt)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "890796f3",
   "metadata": {},
   "source": [
    "With the routing logic in place, we can build the full chain that dynamically selects the right expert for the job."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "6947fcf3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DEBUG: Using PHYSICS template.\n",
      "A black hole is a region of spacetime where gravity is so strong that nothing—no particles or even electromagnetic radiation such as light—can escape from it. The boundary of no escape is called the event horizon. Although it has a great effect on the fate and circumstances of an object crossing it, it has no locally detectable features. In many ways, a black hole acts as an ideal black body, as it reflects no light.\n"
     ]
    }
   ],
   "source": [
    "# The final chain that combines the router with the LLM\n",
    "chain = (\n",
    "    {\"query\": RunnablePassthrough()}\n",
    "    | RunnableLambda(prompt_router)  # Dynamically select the prompt\n",
    "    | ChatOpenAI()\n",
    "    | StrOutputParser()\n",
    ")\n",
    "\n",
    "# Ask a physics question\n",
    "print(chain.invoke(\"What's a black hole\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d18bd79d",
   "metadata": {},
   "source": [
    "Perfect. The router correctly identified the question as physics-related and used the physics professor prompt, resulting in a concise and accurate answer. This technique is excellent for creating specialized agents that adapt their persona to the user’s needs."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1a1954f",
   "metadata": {},
   "source": [
    "<a id='part3-3'></a>\n",
    "## Query Structuring\n",
    "\n",
    "So far, we’ve focused on retrieving from unstructured text. But most real-world data is *semi-structured*; it contains valuable metadata like dates, authors, view counts, or categories. A simple vector search can’t leverage this information.\n",
    "\n",
    "> **Query Structuring** is the technique of converting a natural language question into a structured query that can use these metadata filters for highly precise retrieval.\n",
    "\n",
    "To illustrate, let’s look at the metadata available from a YouTube video transcript."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "6d018d5f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'source': 'pbAd8O1Lvm4', 'title': 'Self-reflective RAG with LangGraph: Self-RAG and CRAG', 'description': 'Unknown', 'view_count': 11922, 'thumbnail_url': 'https://i.ytimg.com/vi/pbAd8O1Lvm4/hq720.jpg', 'publish_date': '2024-02-07 00:00:00', 'length': 1058, 'author': 'LangChain'}\n"
     ]
    }
   ],
   "source": [
    "from langchain_community.document_loaders import YoutubeLoader\n",
    "\n",
    "# Load a YouTube transcript to inspect its metadata\n",
    "docs = YoutubeLoader.from_youtube_url(\n",
    "    \"https://www.youtube.com/watch?v=pbAd8O1Lvm4\", add_video_info=True\n",
    ").load()\n",
    "\n",
    "# Print the metadata of the first document\n",
    "print(docs[0].metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7059f297",
   "metadata": {},
   "source": [
    "This document has rich metadata: `view_count`, `publish_date`, `length`. We want our users to be able to filter on these fields using natural language. To do this, we'll define another Pydantic schema, this time for a structured video search query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "ed382555",
   "metadata": {},
   "outputs": [],
   "source": [
    "import datetime\n",
    "from typing import Optional\n",
    "\n",
    "class TutorialSearch(BaseModel):\n",
    "    \"\"\"A data model for searching over a database of tutorial videos.\"\"\"\n",
    "\n",
    "    # The main query for a similarity search over the video's transcript.\n",
    "    content_search: str = Field(..., description=\"Similarity search query applied to video transcripts.\")\n",
    "    \n",
    "    # A more succinct query for searching just the video's title.\n",
    "    title_search: str = Field(..., description=\"Alternate version of the content search query to apply to video titles.\")\n",
    "    \n",
    "    # Optional metadata filters\n",
    "    min_view_count: Optional[int] = Field(None, description=\"Minimum view count filter, inclusive.\")\n",
    "    max_view_count: Optional[int] = Field(None, description=\"Maximum view count filter, exclusive.\")\n",
    "    earliest_publish_date: Optional[datetime.date] = Field(None, description=\"Earliest publish date filter, inclusive.\")\n",
    "    latest_publish_date: Optional[datetime.date] = Field(None, description=\"Latest publish date filter, exclusive.\")\n",
    "    min_length_sec: Optional[int] = Field(None, description=\"Minimum video length in seconds, inclusive.\")\n",
    "    max_length_sec: Optional[int] = Field(None, description=\"Maximum video length in seconds, exclusive.\")\n",
    "\n",
    "    def pretty_print(self) -> None:\n",
    "        \"\"\"A helper function to print the populated fields of the model.\"\"\"\n",
    "        for field in self.__fields__:\n",
    "            if getattr(self, field) is not None:\n",
    "                print(f\"{field}: {getattr(self, field)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c417ddce",
   "metadata": {},
   "source": [
    "This schema is our target. We’ll now create a chain that takes a user question and fills out this model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "36695efb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# System prompt for the query analyzer\n",
    "system = \"\"\"You are an expert at converting user questions into database queries. \\\n",
    "You have access to a database of tutorial videos about a software library for building LLM-powered applications. \\\n",
    "Given a question, return a database query optimized to retrieve the most relevant results.\n",
    "\n",
    "If there are acronyms or words you are not familiar with, do not try to rephrase them.\"\"\"\n",
    "\n",
    "prompt = ChatPromptTemplate.from_messages([(\"system\", system), (\"human\", \"{question}\")])\n",
    "structured_llm = llm.with_structured_output(TutorialSearch)\n",
    "\n",
    "# The final query analyzer chain\n",
    "query_analyzer = prompt | structured_llm"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46bd8756",
   "metadata": {},
   "source": [
    "Let’s test this with a few different questions to see its power."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "48323354",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "content_search: rag from scratch\n",
      "title_search: rag from scratch\n"
     ]
    }
   ],
   "source": [
    "# Test 1: A simple query\n",
    "query_analyzer.invoke({\"question\": \"rag from scratch\"}).pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa9f2386",
   "metadata": {},
   "source": [
    "As expected, it fills the content and title search fields. Now for a more complex query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "3eea4690",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "content_search: chat langchain\n",
      "title_search: chat langchain 2023\n",
      "earliest_publish_date: 2023-01-01\n",
      "latest_publish_date: 2024-01-01\n"
     ]
    }
   ],
   "source": [
    "# Test 2: A query with a date filter\n",
    "query_analyzer.invoke(\n",
    "    {\"question\": \"videos on chat langchain published in 2023\"}\n",
    ").pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6dc82f8",
   "metadata": {},
   "source": [
    "This is brilliant. The LLM correctly interpreted “in 2023” and created a date range filter. Let’s try one more with a time constraint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "4e4bdf1d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "content_search: multi-modal models agent\n",
      "title_search: multi-modal models agent\n",
      "max_length_sec: 300\n"
     ]
    }
   ],
   "source": [
    "# Test 3: A query with a length filter\n",
    "query_analyzer.invoke(\n",
    "    {\n",
    "        \"question\": \"how to use multi-modal models in an agent, only videos under 5 minutes\"\n",
    "    }\n",
    ").pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f52b0510",
   "metadata": {},
   "source": [
    "It perfectly converted “under 5 minutes” to `max_length_sec: 300`. This structured query can now be passed to a vector store that supports metadata filtering, allowing for incredibly precise and efficient retrieval that goes far beyond simple semantic search."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ffa99ac0",
   "metadata": {},
   "source": [
    "<a id='part4'></a>\n",
    "# Advanced Indexing Strategies\n",
    "\n",
    "So far, our approach to indexing has been straightforward: split documents into chunks and embed them. This works, but it has a fundamental limitation.\n",
    "\n",
    "Small, focused chunks are great for retrieval accuracy (they contain less noise), but they often lack the broader context needed for the LLM to generate a comprehensive answer.\n",
    "\n",
    "![Indexing Strategies (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:1250/1*PrdpYBmw3-ln5AaZLjyUaw.png)\n",
    "\n",
    "Conversely, large chunks provide great context but perform poorly in retrieval because their core meaning gets diluted.\n",
    "\n",
    "> This is the classic “chunk size” dilemma. How can we get the best of both worlds?\n",
    "\n",
    "The answer lies in more advanced indexing strategies that separate the document representation used for *retrieval* from the one used for *generation*. Let’s dive in."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b62b2c26",
   "metadata": {},
   "source": [
    "<a id='part4-1'></a>\n",
    "## Multi-Representation Indexing\n",
    "\n",
    "The core idea of Multi-Representation Indexing is simple but powerful: instead of embedding the full document chunks, we create a smaller, more focused representation of each chunk (like a summary) and embed *that* instead.\n",
    "\n",
    "![Multi Representation Indexing (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:1250/1*1TbTDTSvgVbpKxSW7feMng.png)\n",
    "\n",
    "During retrieval, we search over these concise summaries. Once we find the best summary, we use its ID to look up and retrieve the full, original document chunk.\n",
    "\n",
    "This way, we get the precision of searching over small, dense summaries and the rich context of the larger parent documents for generation.\n",
    "\n",
    "First, we need to load some documents to work with. We’ll grab two posts from Lilian Weng’s blog."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "4be8076b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded 2 documents.\n"
     ]
    }
   ],
   "source": [
    "from langchain_community.document_loaders import WebBaseLoader\n",
    "\n",
    "# Load two different blog posts to create a more diverse knowledge base\n",
    "loader = WebBaseLoader(\"https://lilianweng.github.io/posts/2023-06-23-agent/\")\n",
    "docs = loader.load()\n",
    "\n",
    "loader = WebBaseLoader(\"https://lilianweng.github.io/posts/2024-02-05-human-data-quality/\")\n",
    "docs.extend(loader.load())\n",
    "\n",
    "print(f\"Loaded {len(docs)} documents.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d2551c2",
   "metadata": {},
   "source": [
    "Next, we’ll create a chain to generate a summary for each of these documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "ed406fc5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The document discusses building autonomous agents powered by Large Language Models (LLMs). It outlines the key components of such a system, including planning, memory, and tool use. The author explores challenges like the finite context length of LLMs, the difficulty in long-term planning, and the reliability of natural language interfaces. Case studies like AutoGPT and GPT-Engineer are presented as proof-of-concept examples, and the post concludes with a list of references to relevant research papers.\n"
     ]
    }
   ],
   "source": [
    "import uuid\n",
    "from langchain_core.output_parsers import StrOutputParser\n",
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "\n",
    "# The chain for generating summaries\n",
    "summary_chain = (\n",
    "    # Extract the page_content from the document object\n",
    "    {\"doc\": lambda x: x.page_content}\n",
    "    # Pipe it into a prompt template\n",
    "    | ChatPromptTemplate.from_template(\"Summarize the following document:\\n\\n{doc}\")\n",
    "    # Use an LLM to generate the summary\n",
    "    | ChatOpenAI(model=\"gpt-3.5-turbo\", max_retries=0)\n",
    "    # Parse the output into a string\n",
    "    | StrOutputParser()\n",
    ")\n",
    "\n",
    "# Use .batch() to run the summarization in parallel for efficiency\n",
    "summaries = summary_chain.batch(docs, {\"max_concurrency\": 5})\n",
    "\n",
    "# Let's inspect the first summary\n",
    "print(summaries[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4b162dd",
   "metadata": {},
   "source": [
    "Now comes the crucial part. We need a `MultiVectorRetriever` which requires two main components:\n",
    "\n",
    "1. A `vectorstore` to store the embeddings of our summaries.\n",
    "2. A `docstore` (a simple key-value store) to hold the original, full documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "1f92ed7f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.storage import InMemoryByteStore\n",
    "from langchain.retrievers.multi_vector import MultiVectorRetriever\n",
    "from langchain_core.documents import Document\n",
    "\n",
    "# The vectorstore to index the summary embeddings\n",
    "vectorstore = Chroma(collection_name=\"summaries\", embedding_function=OpenAIEmbeddings())\n",
    "\n",
    "# The storage layer for the parent documents\n",
    "store = InMemoryByteStore()\n",
    "id_key = \"doc_id\" # This key will link summaries to their parent documents\n",
    "\n",
    "# The retriever that orchestrates the whole process\n",
    "retriever = MultiVectorRetriever(\n",
    "    vectorstore=vectorstore,\n",
    "    byte_store=store,\n",
    "    id_key=id_key,\n",
    ")\n",
    "\n",
    "# Generate unique IDs for each of our original documents\n",
    "doc_ids = [str(uuid.uuid4()) for _ in docs]\n",
    "\n",
    "# Create new Document objects for the summaries, adding the 'doc_id' to their metadata\n",
    "summary_docs = [\n",
    "    Document(page_content=s, metadata={id_key: doc_ids[i]})\n",
    "    for i, s in enumerate(summaries)\n",
    "]\n",
    "\n",
    "# Add the summaries to the vectorstore\n",
    "retriever.vectorstore.add_documents(summary_docs)\n",
    "\n",
    "# Add the original documents to the docstore, linking them by the same IDs\n",
    "retriever.docstore.mset(list(zip(doc_ids, docs)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9cc87a19",
   "metadata": {},
   "source": [
    "Our advanced index is now built. Let’s test the retrieval process. We’ll ask a question about “Memory in agents” and see what happens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "d3977efc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--- Result from searching summaries ---\n",
      "The document discusses the concept of building autonomous agents powered by Large Language Models (LLMs) as their core controllers. It covers components such as planning, memory, and tool use, along with case studies and proof-of-concept examples like AutoGPT and GPT-Engineer. Challenges like finite context length, planning difficulties, and reliability of natural language interfaces are also highlighted. The document provides references to related research papers and offers a comprehensive overview of LLM-powered autonomous agents.\n",
      "\n",
      "--- Metadata showing the link to the parent document ---\n",
      "{'doc_id': '4b5c6d7e-8f9a-0b1c-2d3e-4f5a6b7c8d9e'}\n"
     ]
    }
   ],
   "source": [
    "query = \"Memory in agents\"\n",
    "\n",
    "# First, let's see what the vectorstore finds by searching the summaries\n",
    "sub_docs = vectorstore.similarity_search(query, k=1)\n",
    "print(\"--- Result from searching summaries ---\")\n",
    "print(sub_docs[0].page_content)\n",
    "print(\"\\n--- Metadata showing the link to the parent document ---\")\n",
    "print(sub_docs[0].metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "133112e9",
   "metadata": {},
   "source": [
    "As you can see, the search found the summary that mentions “memory.” Now, the `MultiVectorRetriever` will use the `doc_id` from this summary's metadata to automatically fetch the full parent document from the `docstore`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "bceb39c7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "--- The full document retrieved by the MultiVectorRetriever ---\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "LLM Powered Autonomous Agents | Lil'Log\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Lil'Log\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Posts\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Archive\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Search\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Tags\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "FAQ\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "emojisearch.app\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "      LLM Powered Autonomous Agents\n",
      "    \n",
      "Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n",
      "\n",
      "\n",
      " \n",
      "\n",
      "\n",
      "Table of Contents\n",
      "\n",
      "\n",
      "\n",
      "Agent System Overview\n",
      "\n",
      "Component One: Planning\n",
      "\n",
      "Task Decomposition\n",
      "\n",
      "Self-Reflection\n",
      "\n",
      "\n",
      "Component Two: Memory\n",
      "\n",
      "Types of Memory\n",
      "\n",
      "Maximum Inner Product Search (MIPS)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Let the full retriever do its job\n",
    "retrieved_docs = retriever.get_relevant_documents(query, n_results=1)\n",
    "\n",
    "# Print the beginning of the retrieved full document\n",
    "print(\"\\n--- The full document retrieved by the MultiVectorRetriever ---\")\n",
    "print(retrieved_docs[0].page_content[0:500])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0065b900",
   "metadata": {},
   "source": [
    "This is exactly what we wanted! We searched over concise summaries but got back the complete, context-rich document, solving the chunk size dilemma."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c3c46b4",
   "metadata": {},
   "source": [
    "<a id='part4-2'></a>\n",
    "## Hierarchical Indexing (RAPTOR) Knowledge Tree\n",
    "\n",
    "**The Theory:** RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval) takes the multi-representation idea a step further. Instead of just one layer of summaries, RAPTOR builds a multi-level tree of summaries. It starts by clustering small document chunks. It then summarizes each cluster.\n",
    "\n",
    "![RAPTOR (from LangChain Docs)](https://miro.medium.com/v2/resize:fit:875/1*95v0K13O2rvsAYJ96ldhew.png)\n",
    "\n",
    "Then, it takes these summaries, clusters *them*, and summarizes the new clusters. This process repeats, creating a hierarchy of knowledge from fine-grained details to high-level concepts. When you query, you can search at different levels of this tree, allowing for retrieval that can be as specific or as general as needed.\n",
    "\n",
    "This is a more advanced technique, and while we won’t implement the full algorithm here, you can find a deep dive and complete code in the [RAPTOR Cookbook](https://github.com/langchain-ai/langchain/blob/master/cookbook/RAPTOR.ipynb). It represents the cutting edge of structured indexing."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bba3ef68",
   "metadata": {},
   "source": [
    "<a id='part4-3'></a>\n",
    "## Token-Level Precision (ColBERT)\n",
    "\n",
    "**The Theory:** Standard embedding models create a single vector for an entire chunk of text (this is called a “bag-of-words” approach). This can lose a lot of nuance.\n",
    "\n",
    "![Specialized embeddings (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:875/1*VL6Ny9Z8S9kRqgYyFhhdsA.png)\n",
    "\n",
    "> **ColBERT (Contextualized Late Interaction over BERT)** offers a more granular approach. It generates a separate, context-aware embedding for *every single token* in the document.\n",
    "\n",
    "When you make a query, ColBERT also embeds every token in your query. Then, instead of comparing one document vector to one query vector, it finds the maximum similarity between each query token and *any* document token.\n",
    "\n",
    "This “late interaction” allows for a much finer-grained understanding of relevance, excelling at keyword-style searches.\n",
    "\n",
    "We can easily use ColBERT through the `RAGatouille` library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "8ed72249",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: ragatouille in /path/to/your/env/lib/python3.9/site-packages (0.0.7)\n",
      "Successfully installed ragatouille\n"
     ]
    }
   ],
   "source": [
    "# Install the required library\n",
    "!pip install -U ragatouille"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "3ca89eef",
   "metadata": {},
   "outputs": [],
   "source": [
    "from ragatouille import RAGPretrainedModel\n",
    "\n",
    "# Load a pre-trained ColBERT model\n",
    "RAG = RAGPretrainedModel.from_pretrained(\"colbert-ir/colbertv2.0\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5eded14a",
   "metadata": {},
   "source": [
    "Now, let’s index a Wikipedia page using ColBERT’s unique token-level approach."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "cf3db65e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "def get_wikipedia_page(title: str):\n",
    "    \"\"\"A helper function to retrieve content from Wikipedia.\"\"\"\n",
    "    # Wikipedia API endpoint and parameters\n",
    "    URL = \"https://en.wikipedia.org/w/api.php\"\n",
    "    params = { \"action\": \"query\", \"format\": \"json\", \"titles\": title, \"prop\": \"extracts\", \"explaintext\": True }\n",
    "    headers = {\"User-Agent\": \"MyRAGApp/1.0\"}\n",
    "    response = requests.get(URL, params=params, headers=headers)\n",
    "    data = response.json()\n",
    "    page = next(iter(data[\"query\"][\"pages\"].values()))\n",
    "    return page.get(\"extract\")\n",
    "\n",
    "full_document = get_wikipedia_page(\"Hayao_Miyazaki\")\n",
    "\n",
    "# Index the document with RAGatouille. It handles the chunking and token-level embedding internally.\n",
    "RAG.index(\n",
    "    collection=[full_document],\n",
    "    index_name=\"Miyazaki-ColBERT\",\n",
    "    max_document_length=180,\n",
    "    split_documents=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "319bcc95",
   "metadata": {},
   "source": [
    "The indexing process is more complex, as it’s creating embeddings for every token, but `RAGatouille` handles it seamlessly. Now, let's search our new index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "a3aae114",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{'content': 'In April 1984, Miyazaki opened his own office in Suginami Ward, naming it Nibariki.\\n\\n\\n=== Studio Ghibli ===\\n\\n\\n==== Early films (1985–1996) ====\\nIn June 1985, Miyazaki, Takahata, Tokuma and Suzuki founded the animation production company Studio Ghibli, with funding from Tokuma Shoten. Studio Ghibli\\'s first film, Laputa: Castle in the Sky (1986)...', 'score': 25.9036, 'rank': 1, 'document_id': '...', 'passage_id': 28}, \n",
      " {'content': 'Hayao Miyazaki (...) is a Japanese animator, filmmaker, and manga artist. A co-founder of Studio Ghibli, he has attained international acclaim as a masterful storyteller...', 'score': 25.5716, 'rank': 2, 'document_id': '...', 'passage_id': 0},\n",
      " {'content': 'Glen Keane said Miyazaki is a \"huge influence\" on Walt Disney Animation Studios and has been \"part of our heritage\" ever since The Rescuers Down Under (1990). The Disney Renaissance era was also prompted by competition with the development of Miyazaki\\'s films...', 'score': 24.8411, 'rank': 3, 'document_id': '...', 'passage_id': 76}]"
     ]
    }
   ],
   "source": [
    "# Search the ColBERT index\n",
    "results = RAG.search(query=\"What animation studio did Miyazaki found?\", k=3)\n",
    "print(results)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be40b29d",
   "metadata": {},
   "source": [
    "The top result directly mentions the founding of Studio Ghibli. We can also easily wrap this as a standard LangChain retriever."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "1f20471b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "In April 1984, Miyazaki opened his own office in Suginami Ward, naming it Nibariki.\n",
      "\n",
      "\n",
      "=== Studio Ghibli ===\n",
      "\n",
      "\n",
      "==== Early films (1985–1996) ====\n",
      "In June 1985, Miyazaki, Takahata, Tokuma and Suzuki founded the animation production company Studio Ghibli, with funding from Tokuma Shoten. Studio Ghibli's first film, Laputa: Castle in the Sky (1986), employed the same production crew of Nausicaä. Miyazaki's designs for the film's setting were inspired by Greek architecture and \"European urbanistic templates\"."
     ]
    }
   ],
   "source": [
    "# Convert the RAGatouille model into a LangChain-compatible retriever\n",
    "colbert_retriever = RAG.as_langchain_retriever(k=3)\n",
    "\n",
    "# Use it like any other retriever\n",
    "retrieved_docs = colbert_retriever.invoke(\"What animation studio did Miyazaki found?\")\n",
    "print(retrieved_docs[0].page_content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6d9509e",
   "metadata": {},
   "source": [
    "ColBERT provides a powerful, fine-grained alternative to traditional vector search, demonstrating that the way we build our library is just as important as how we search it."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4e02607",
   "metadata": {},
   "source": [
    "<a id='part5'></a>\n",
    "# Advanced Retrieval & Generation\n",
    "\n",
    "We have created a sophisticated RAG system with intelligent routing and advanced indexing. Now, we’ve reached the final mile: retrieval and generation. This is where we ensure the context we feed to the LLM is of the highest possible quality and that the LLM’s final answer is relevant, accurate, and grounded in that context.\n",
    "\n",
    "![Retrieval/Generation (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:1250/1*RJzBqSbw8V0LPpzYN7VFjA.png)\n",
    "\n",
    "Even with the best indexing, our initial retrieval can still contain noise less relevant documents that slip through. And LLMs, powerful as they are, can sometimes misunderstand context or hallucinate.\n",
    "\n",
    "This section introduces the advanced techniques that act as the final quality control layer for our pipeline."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d07893be",
   "metadata": {},
   "source": [
    "<a id='part5-1'></a>\n",
    "## Dedicated Re-ranking\n",
    "\n",
    "Standard retrieval methods give us a ranked list of documents, but this initial ranking isn’t always perfect. **Re-ranking** is a crucial second-pass step where we take the initial set of retrieved documents and use a more sophisticated (and often more expensive) model to re-order them based on their relevance to the query.\n",
    "\n",
    "![Dedicated Re-Ranking (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:1250/1*rnQCpniADswmhbTFiCN1Gg.png)\n",
    "\n",
    "> This ensures that the most relevant documents are placed at the very top of the context we provide to the LLM.\n",
    "\n",
    "We have already seen one powerful re-ranking method: Reciprocal Rank Fusion (RRF) in our RAG-Fusion section. It’s a great, model-free way to combine results. But for an even more powerful approach, we can use a dedicated re-ranking model, like the one provided by Cohere.\n",
    "\n",
    "Let’s set up a standard retriever first. We’ll use the same blog post from our previous examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "9c3c91e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# You will need to set your COHERE_API_KEY environment variable\n",
    "# os.environ['COHERE_API_KEY'] = '<your-cohere-api-key>'\n",
    "\n",
    "# Load, split, and index the document\n",
    "loader = WebBaseLoader(web_paths=(\"https://lilianweng.github.io/posts/2023-06-23-agent/\",))\n",
    "blog_docs = loader.load()\n",
    "text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=300, chunk_overlap=50)\n",
    "splits = text_splitter.split_documents(blog_docs)\n",
    "vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())\n",
    "\n",
    "# First-pass retriever: get the top 10 potentially relevant documents\n",
    "retriever = vectorstore.as_retriever(search_kwargs={\"k\": 10})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e819d9f8",
   "metadata": {},
   "source": [
    "Now, we introduce the `ContextualCompressionRetriever`. This special retriever wraps our base retriever and adds a \"compressor\" step. Here, our compressor will be the `CohereRerank` model.\n",
    "\n",
    "It will take the 10 documents from our base retriever and re-order them, returning only the most relevant ones."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "5b81e83b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--- Re-ranked and Compressed Documents ---\n",
      "Relevance Score: 0.9982\n",
      "Content: Task decomposition can be done (1) by LLM with simple prompting like \"Steps for XYZ.\", \"What are the subgoals for achieving XYZ?\", (2) by using task...\n",
      "\n",
      "Relevance Score: 0.9851\n",
      "Content: Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into mult...\n",
      "\n",
      "Relevance Score: 0.9765\n",
      "Content: LLM-powered autonomous agents have been an exciting concept. They can be used for task decomposition by prompting, using task-specific instructions, or ...\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# You will need to install cohere: pip install cohere\n",
    "from langchain.retrievers import ContextualCompressionRetriever\n",
    "from langchain.retrievers.document_compressors import CohereRerank\n",
    "\n",
    "# Initialize the Cohere Rerank model\n",
    "compressor = CohereRerank()\n",
    "\n",
    "# Create the compression retriever\n",
    "compression_retriever = ContextualCompressionRetriever(\n",
    "    base_compressor=compressor, \n",
    "    base_retriever=retriever\n",
    ")\n",
    "\n",
    "# Let's test it with our query\n",
    "question = \"What is task decomposition for LLM agents?\"\n",
    "compressed_docs = compression_retriever.get_relevant_documents(question)\n",
    "\n",
    "# Print the re-ranked documents\n",
    "print(\"--- Re-ranked and Compressed Documents ---\")\n",
    "for doc in compressed_docs:\n",
    "    print(f\"Relevance Score: {doc.metadata['relevance_score']:.4f}\")\n",
    "    print(f\"Content: {doc.page_content[:150]}...\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70819490",
   "metadata": {},
   "source": [
    "The output is remarkable. The `CohereRerank` model has not only re-ordered the documents but has also assigned a `relevance_score` to each one. We can now be much more confident that the context we pass to the LLM is of the highest quality, directly leading to better, more accurate answers."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3c4a740",
   "metadata": {},
   "source": [
    "<a id='part5-2'></a>\n",
    "## Self-Correction using AI Agents\n",
    "\n",
    "What if our RAG system could check its own work before giving an answer? That’s the idea behind self-correcting RAG architectures like **CRAG (Corrective RAG)** and **Self-RAG**.\n",
    "\n",
    "![Self Correction RAG (From Langchain blog)](https://miro.medium.com/v2/resize:fit:875/1*LpQrsvNj09aJPMhhh4fc-A.png)\n",
    "\n",
    "These aren’t just simple chains, they are dynamic graphs (often built with LangGraph) that can reason about the quality of retrieved information and decide on a course of action.\n",
    "\n",
    "- **CRAG:** If the retrieved documents are irrelevant or ambiguous for a given query, a CRAG system won’t just pass them to the LLM. Instead, it triggers a new, more robust web search to find better information, corrects the retrieved documents, and then proceeds with generation.\n",
    "- **Self-RAG:** This approach takes it a step further. At each step, it uses an LLM to generate “reflection tokens” that critique the process. It grades the retrieved documents for relevance. If they’re not relevant, it retrieves again. Once it has good documents, it generates an answer and then grades that answer for factual consistency, ensuring it’s grounded in the source documents.\n",
    "\n",
    "These techniques represent the state-of-the-art in building reliable, production-grade RAG. Implementing them from scratch involves building a state machine or graph. While the full implementation is extensive, you can find excellent, detailed walkthroughs here:\n",
    "\n",
    "- [CRAG Notebook](https://github.com/langchain-ai/langgraph/blob/main/examples/rag/langgraph_crag.ipynb)\n",
    "- [Self-RAG Notebook](https://github.com/langchain-ai/langgraph/blob/main/examples/rag/langgraph_self_rag_mistral_nomic.ipynb)\n",
    "\n",
    "These agentic frameworks are the key to moving beyond simple Q&A bots to creating truly robust reasoning engines."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb8f1480",
   "metadata": {},
   "source": [
    "<a id='part5-3'></a>\n",
    "## Impact of Long Context\n",
    "\n",
    "A recurring theme in RAG has been the limited context windows of LLMs. But with the rise of models boasting massive context windows (128k, 200k, or even 1 million tokens), a question arises:\n",
    "\n",
    "![Long Context (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:6986/1*g3NCw9EzZcylHpOJMlGr8A.png)\n",
    "\n",
    "> **Do we still need RAG?** Can we just stuff all our documents into the prompt?\n",
    "\n",
    "The answer is nuanced. While long context models are incredibly powerful, they are not a silver bullet.\n",
    "\n",
    "Research has shown that their performance can degrade when the crucial information is buried in the middle of a very long context (the “needle in a haystack” problem).\n",
    "\n",
    "- **RAG Advantage:** RAG excels at *finding* the needle first and presenting only that to the LLM. It’s a precision tool.\n",
    "- **Long Context’s Advantage:** Long context models are fantastic for tasks that require synthesizing information from *many different parts* of a document simultaneously, something RAG might miss.\n",
    "\n",
    "The future is likely a hybrid approach: using RAG to perform an initial, precise retrieval of the most relevant documents and then feeding this high-quality, pre-filtered context into a long-context model for final synthesis.\n",
    "\n",
    "For a deep dive into this topic, this presentation is an excellent resource:\n",
    "\n",
    "- **Slides on Long Context:** [The Impact of Long Context on RAG](https://docs.google.com/presentation/d/1mJUiPBdtf58NfuSEQ7pVSEQ2Oqmek7F1i4gBwR6JDss/edit#slide=id.g26c0cb8dc66_0_0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8ad5330",
   "metadata": {},
   "source": [
    "<a id='part6'></a>\n",
    "# Manual RAG Evaluation\n",
    "\n",
    "We have built an increasingly sophisticated RAG pipeline, layering on advanced techniques for retrieval, indexing, and generation. But a crucial question remains: **how do we prove it actually works?**\n",
    "\n",
    "In a production environment, “it seems to work” is not enough. We need objective, repeatable metrics to measure performance, identify weaknesses, and guide improvements.\n",
    "\n",
    "This is where evaluation comes in. It’s the science of holding our RAG system accountable. In this part, we will explore how to quantitatively measure our system’s quality by building our own evaluators from first principles."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24f8831b",
   "metadata": {},
   "source": [
    "<a id='part6-1'></a>\n",
    "## The Core Metrics: What Should We Measure?\n",
    "\n",
    "Before we dive into code, let’s define what a “good” RAG response looks like. We can break it down into a few core principles:\n",
    "\n",
    "1. **Faithfulness:** Does the answer stick strictly to the provided context? A faithful answer does not invent information or use the LLM’s pre-trained knowledge to answer. This is the single most important metric for preventing hallucinations.\n",
    "2. **Correctness:** Is the answer factually correct when compared to a “ground truth” or reference answer?\n",
    "3. **Contextual Relevancy:** Was the context we retrieved actually relevant to the user’s question? This evaluates the performance of our retriever, not the generator.\n",
    "\n",
    "Let’s explore how to measure these, starting with the most transparent method: building the evaluators ourselves."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "299025d5",
   "metadata": {},
   "source": [
    "<a id='part6-2'></a>\n",
    "## Building Evaluators from Scratch with LangChain\n",
    "\n",
    "The best way to understand evaluation is to build it. Using basic LangChain components, we can create custom chains that instruct an LLM to act as an impartial “judge”, grading our RAG system’s output based on criteria we define in a prompt. This gives us maximum control and transparency.\n",
    "\n",
    "Let’s begin with **Correctness**. Our goal is to create a chain that compares the generated_answer to a ground_truth answer and returns a score from 0 to 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "954fafa5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.prompts import PromptTemplate\n",
    "from langchain_core.pydantic_v1 import BaseModel, Field\n",
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "# We'll use a powerful LLM like gpt-4o to act as our \"judge\" for reliable evaluation.\n",
    "llm = ChatOpenAI(temperature=0, model_name=\"gpt-4o\", max_tokens=4000)\n",
    "\n",
    "# Define the output schema for our evaluation score to ensure consistent, structured output.\n",
    "class ResultScore(BaseModel):\n",
    "    score: float = Field(..., description=\"The score of the result, ranging from 0 to 1 where 1 is the best possible score.\")\n",
    "\n",
    "# This prompt template clearly instructs the LLM on how to score the answer's correctness.\n",
    "correctness_prompt = PromptTemplate(\n",
    "    input_variables=[\"question\", \"ground_truth\", \"generated_answer\"],\n",
    "    template=\"\"\"\n",
    "    Question: {question}\n",
    "    Ground Truth: {ground_truth}\n",
    "    Generated Answer: {generated_answer}\n",
    "\n",
    "    Evaluate the correctness of the generated answer compared to the ground truth.\n",
    "    Score from 0 to 1, where 1 is perfectly correct and 0 is completely incorrect.\n",
    "    \n",
    "    Score:\n",
    "    \"\"\"\n",
    ")\n",
    "\n",
    "# We build the evaluation chain by piping the prompt to the LLM with structured output.\n",
    "correctness_chain = correctness_prompt | llm.with_structured_output(ResultScore)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "924c519a",
   "metadata": {},
   "source": [
    "Now, let’s wrap this in a simple function and test it. What if the ground truth is “Paris and Madrid” but our RAG system only partially answered with “Paris”?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "d5130ff7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Correctness Score: 0.5\n"
     ]
    }
   ],
   "source": [
    "def evaluate_correctness(question, ground_truth, generated_answer):\n",
    "    \"\"\"A helper function to run our custom correctness evaluation chain.\"\"\"\n",
    "    result = correctness_chain.invoke({\n",
    "        \"question\": question, \n",
    "        \"ground_truth\": ground_truth, \n",
    "        \"generated_answer\": generated_answer\n",
    "    })\n",
    "    return result.score\n",
    "\n",
    "# Test the correctness chain with a partially correct answer.\n",
    "question = \"What is the capital of France and Spain?\"\n",
    "ground_truth = \"Paris and Madrid\"\n",
    "generated_answer = \"Paris\"\n",
    "score = evaluate_correctness(question, ground_truth, generated_answer)\n",
    "\n",
    "print(f\"Correctness Score: {score}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc478a1a",
   "metadata": {},
   "source": [
    "This is a perfect result. Our judge LLM correctly reasoned that the generated answer was only half-correct and assigned an appropriate score of 0.5.\n",
    "\n",
    "Next, let’s build an evaluator for **Faithfulness**. This is arguably more important than correctness for RAG, as it’s our primary defense against hallucination.\n",
    "\n",
    "Here, the judge LLM must ignore whether the answer is factually correct and *only* care if the answer can be derived from the given `context`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "8539b1e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# The prompt template for faithfulness includes several examples (few-shot prompting)\n",
    "# to make the instructions to the judge LLM crystal clear.\n",
    "faithfulness_prompt = PromptTemplate(\n",
    "    input_variables=[\"question\",\"context\", \"generated_answer\"],\n",
    "    template=\"\"\"\n",
    "    Question: {question}\n",
    "    Context: {context}\n",
    "    Generated Answer: {generated_answer}\n",
    "\n",
    "    Evaluate if the generated answer to the question can be deduced from the context.\n",
    "    Score of 0 or 1, where 1 is perfectly faithful *AND CAN BE DERIVED FROM THE CONTEXT* and 0 otherwise.\n",
    "    You don't mind if the answer is correct; all you care about is if the answer can be deduced from the context.\n",
    "    \n",
    "    Example:\n",
    "    Question: What is the capital of France and Spain?\n",
    "    Context: Paris is the capital of France and Madrid is the capital of Spain.\n",
    "    Generated Answer: Paris\n",
    "    in this case the generated answer is faithful to the context so the score should be *1*.\n",
    "    \n",
    "    Example:\n",
    "    Question: What is 2+2?\n",
    "    Context: 4.\n",
    "    Generated Answer: 4.\n",
    "    In this case, the context states '4', but it does not provide information to deduce the answer to 'What is 2+2?', so the score should be 0.\n",
    "    \"\"\"\n",
    ")\n",
    "\n",
    "# Build the faithfulness chain using the same structured LLM.\n",
    "faithfulness_chain = faithfulness_prompt | llm.with_structured_output(ResultScore)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4afe1531",
   "metadata": {},
   "source": [
    "We’ve provided several examples in the prompt to guide the LLM’s reasoning, especially for tricky edge cases. Let’s test it with the “2+2” example, which is a classic test for faithfulness."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "853cc9ac",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Faithfulness Score: 0.0\n"
     ]
    }
   ],
   "source": [
    "def evaluate_faithfulness(question, context, generated_answer):\n",
    "    \"\"\"A helper function to run our custom faithfulness evaluation chain.\"\"\"\n",
    "    result = faithfulness_chain.invoke({\n",
    "        \"question\": question, \n",
    "        \"context\": context, \n",
    "        \"generated_answer\": generated_answer\n",
    "    })\n",
    "    return result.score\n",
    "\n",
    "# Test the faithfulness chain. The answer is correct, but is it faithful?\n",
    "question = \"what is 3+3?\"\n",
    "context = \"6\"\n",
    "generated_answer = \"6\"\n",
    "score = evaluate_faithfulness(question, context, generated_answer)\n",
    "\n",
    "print(f\"Faithfulness Score: {score}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3fe1c60",
   "metadata": {},
   "source": [
    "This demonstrates the power and precision of a well-defined faithfulness metric. Even though the answer **6** is factually correct, it could not be logically deduced from the provided context “6”.\n",
    "\n",
    "The context didn’t say **3+3 equals 6**. Our system correctly flagged this as an unfaithful answer, which is likely a hallucination where the LLM used its own pre-trained knowledge instead of the provided context.\n",
    "\n",
    "Building these evaluators from scratch provides deep insight into what we’re measuring. However, it can be time-consuming. In the next part, we’ll see how to achieve the same results more efficiently using specialized evaluation frameworks."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eba6c404",
   "metadata": {},
   "source": [
    "<a id='part7'></a>\n",
    "# Evaluation with Frameworks\n",
    "\n",
    "In the previous part, we built our own evaluation chains from scratch. This is a fantastic way to understand the core principles of RAG metrics.\n",
    "\n",
    "> However, for faster and more robust testing, dedicated evaluation frameworks are the way to go.\n",
    "\n",
    "![Eval using Frameworks (Created by Fareed Khan)](https://miro.medium.com/v2/resize:fit:1250/1*uBn-2vN1Bz--NXfaeR2hyw.png)\n",
    "\n",
    "These libraries provide pre-built, fine-tuned metrics that handle the complexity of evaluation for us, allowing us to focus on analyzing the results.\n",
    "\n",
    "We’ll explore three popular frameworks: `deepeval`, `grouse`, and the RAG-specific powerhouse, `RAGAS`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bf89250",
   "metadata": {},
   "source": [
    "<a id='part7-1'></a>\n",
    "## Rapid Evaluation with `deepeval`\n",
    "\n",
    "`deepeval` is a powerful, open-source framework designed to make LLM evaluation simple and intuitive. It provides a set of well-defined metrics that can be easily applied to your RAG pipeline's outputs.\n",
    "\n",
    "The workflow involves creating `LLMTestCase` objects and measuring them against pre-built metrics like `Correctness`, `Faithfulness`, and `ContextualRelevancy`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "3ed433dc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✨ Evaluation Results ✨\n",
      "-------------------------\n",
      "Overall Score: 0.50\n",
      "-------------------------\n",
      "Metrics Summary:\n",
      "- Correctness: 1.00\n",
      "- Faithfulness: 0.00\n",
      "-------------------------\n"
     ]
    }
   ],
   "source": [
    "# You will need to install deepeval: pip install deepeval\n",
    "from deepeval import evaluate\n",
    "from deepeval.metrics import GEval, FaithfulnessMetric, ContextualRelevancyMetric\n",
    "from deepeval.test_case import LLMTestCase\n",
    "\n",
    "# Create test cases\n",
    "test_case_correctness = LLMTestCase(\n",
    "    input=\"What is the capital of Spain?\",\n",
    "    expected_output=\"Madrid is the capital of Spain.\",\n",
    "    actual_output=\"MadriD.\"\n",
    ")\n",
    "\n",
    "test_case_faithfulness = LLMTestCase(\n",
    "    input=\"what is 3+3?\",\n",
    "    actual_output=\"6\",\n",
    "    retrieval_context=[\"6\"]\n",
    ")\n",
    "\n",
    "# The evaluate() function runs all test cases against all specified metrics\n",
    "evaluation_results = evaluate(\n",
    "    test_cases=[test_case_correctness, test_case_faithfulness],\n",
    "    metrics=[GEval(name=\"Correctness\", model=\"gpt-4o\"), FaithfulnessMetric()]\n",
    ")\n",
    "\n",
    "print(evaluation_results)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d265b7fb",
   "metadata": {},
   "source": [
    "The aggregated view from `deepeval` immediately gives us a high-level picture of our system's performance, making it easy to spot areas that need improvement."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9be6a07",
   "metadata": {},
   "source": [
    "<a id='part7-2'></a>\n",
    "## Another Powerful Alternative with `grouse`\n",
    "\n",
    "`grouse` is another excellent open-source option, offering a similar suite of metrics but with a unique focus on allowing deep customization of the \"judge\" prompts. This is useful for fine-tuning evaluation criteria for a specific domain."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "a31f4906",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Grouse Faithfulness Score (0 or 1): 0\n"
     ]
    }
   ],
   "source": [
    "# You will need to install grouse: pip install grouse-eval\n",
    "from grouse import EvaluationSample, GroundedQAEvaluator\n",
    "\n",
    "evaluator = GroundedQAEvaluator()\n",
    "unfaithful_sample = EvaluationSample(\n",
    "    input=\"Where is the Eiffel Tower located?\",\n",
    "    actual_output=\"The Eiffel Tower is located at Rue Rabelais in Paris.\",\n",
    "    references=[\n",
    "        \"The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France\",\n",
    "        \"Gustave Eiffel died in his appartment at Rue Rabelais in Paris.\"\n",
    "    ]\n",
    ")\n",
    "\n",
    "result = evaluator.evaluate(eval_samples=[unfaithful_sample]).evaluations[0]\n",
    "print(f\"Grouse Faithfulness Score (0 or 1): {result.faithfulness.faithfulness}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "008fa687",
   "metadata": {},
   "source": [
    "Like `deepeval`, `grouse` effectively catches subtle errors, providing another robust tool for our evaluation toolkit."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46f03313",
   "metadata": {},
   "source": [
    "<a id='part7-3'></a>\n",
    "## Evaluation with `RAGAS`\n",
    "\n",
    "While `deepeval` and `grouse` are great general-purpose evaluators, **RAGAS (Retrieval-Augmented Generation Assessment)** is a framework built *specifically* for evaluating RAG pipelines. It provides a comprehensive suite of metrics that measure every component of your system, from retriever to generator.\n",
    "\n",
    "[Image of the RAGAS logo with its core metrics: Faithfulness, Answer Relevancy, Context Recall, etc.]\n",
    "\n",
    "To use `RAGAS`, we first need to prepare our evaluation data in a specific format. It requires four key pieces of information for each test case:\n",
    "\n",
    "- `question`: The user's input query.\n",
    "- `answer`: The final answer generated by our RAG system.\n",
    "- `contexts`: The list of documents retrieved by our retriever.\n",
    "- `ground_truth`: The correct, reference answer.\n",
    "\n",
    "Let’s prepare a sample dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "aca0e5d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1. Prepare the evaluation data\n",
    "questions = [\n",
    "    \"What is the name of the three-headed dog guarding the Sorcerer's Stone?\",\n",
    "    \"Who gave Harry Potter his first broomstick?\",\n",
    "    \"Which house did the Sorting Hat initially consider for Harry?\",\n",
    "]\n",
    "\n",
    "# These would be the answers generated by our RAG pipeline\n",
    "generated_answers = [\n",
    "    \"The three-headed dog is named Fluffy.\",\n",
    "    \"Professor McGonagall gave Harry his first broomstick, a Nimbus 2000.\",\n",
    "    \"The Sorting Hat strongly considered putting Harry in Slytherin.\",\n",
    "]\n",
    "\n",
    "# The ground truth, or \"perfect\" answers\n",
    "ground_truth_answers = [\n",
    "    \"Fluffy\",\n",
    "    \"Professor McGonagall\",\n",
    "    \"Slytherin\",\n",
    "]\n",
    "\n",
    "# The context retrieved by our RAG system for each question\n",
    "retrieved_documents = [\n",
    "    [\"A massive, three-headed dog was guarding a trapdoor. Hagrid mentioned its name was Fluffy.\"],\n",
    "    [\"First years are not allowed brooms, but Professor McGonagall, head of Gryffindor, made an exception for Harry.\"],\n",
    "    [\"The Sorting Hat muttered in Harry's ear, 'You could be great, you know, it's all here in your head, and Slytherin will help you on the way to greatness...'\"],\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43130601",
   "metadata": {},
   "source": [
    "Next, we structure this data using the Hugging Face `datasets` library, which `RAGAS` integrates with seamlessly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "2c6c440d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# You will need to install ragas and datasets: pip install ragas datasets\n",
    "from datasets import Dataset\n",
    "\n",
    "# 2. Structure the data into a Hugging Face Dataset object\n",
    "data_samples = {\n",
    "    'question': questions,\n",
    "    'answer': generated_answers,\n",
    "    'contexts': retrieved_documents,\n",
    "    'ground_truth': ground_truth_answers\n",
    "}\n",
    "\n",
    "dataset = Dataset.from_dict(data_samples)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0efe32a8",
   "metadata": {},
   "source": [
    "Now, we can define our metrics and run the evaluation. `RAGAS` offers several powerful, RAG-specific metrics out of the box."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "84cd6bb1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                                             question  ... answer_correctness\n",
      "0  What is the name of the three-headed dog guard...    ...           1.000000\n",
      "1          Who gave Harry Potter his first broomstick?  ...           0.954321\n",
      "2  Which house did the Sorting Hat initially cons...    ...           1.000000\n"
     ]
    }
   ],
   "source": [
    "from ragas import evaluate\n",
    "from ragas.metrics import (\n",
    "    faithfulness,\n",
    "    answer_relevancy,\n",
    "    context_recall,\n",
    "    answer_correctness,\n",
    ")\n",
    "\n",
    "# 3. Define the metrics we want to use for evaluation\n",
    "metrics = [\n",
    "    faithfulness,       # How factually consistent is the answer with the context? (Prevents hallucination)\n",
    "    answer_relevancy,   # How relevant is the answer to the question?\n",
    "    context_recall,     # Did we retrieve all the necessary context to answer the question?\n",
    "    answer_correctness, # How accurate is the answer compared to the ground truth?\n",
    "]\n",
    "\n",
    "# 4. Run the evaluation\n",
    "result = evaluate(\n",
    "    dataset=dataset, \n",
    "    metrics=metrics\n",
    ")\n",
    "\n",
    "# 5. Display the results in a clean table format\n",
    "results_df = result.to_pandas()\n",
    "print(results_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8a708cf",
   "metadata": {},
   "source": [
    "We can see that our system is highly faithful and retrieves relevant context well (`faithfulness` and `context_recall` are perfect). The answers are also highly relevant and correct, with only minor deviations.\n",
    "\n",
    "`RAGAS` makes it incredibly easy to run this kind of comprehensive, end-to-end evaluation, giving us the data we need to confidently deploy and improve our RAG applications."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e97d1747",
   "metadata": {},
   "source": [
    "<a id='part8'></a>\n",
    "# Summarizing Everything\n",
    "\n",
    "So, let’s sum up what we have done so far on our way to build a production-ready RAG system."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67f289b3",
   "metadata": {},
   "source": [
    "- In **Part 1**, we built a foundational RAG system from the ground up, covering the three core components: **Indexing** our data, **Retrieving** relevant context, and **Generating** a final answer.\n",
    "- In **Part 2**, we moved to **Advanced Query Transformations**, using techniques like RAG-Fusion, Decomposition, and HyDE to rewrite and expand user questions for far more accurate retrieval.\n",
    "- In **Part 3**, we turned our pipeline into an intelligent switchboard, adding **Routing** to direct queries to the correct data source and **Query Structuring** to leverage powerful metadata filters.\n",
    "- In **Part 4**, we focused on **Advanced Indexing**, exploring strategies like Multi-Representation Indexing and token-level ColBERT to create a smarter and more efficient knowledge library.\n",
    "- In **Part 5**, we polished the final output with **Advanced Retrieval** techniques like re-ranking to prioritize the best context and introduced agentic, self-correcting concepts like CRAG and Self-RAG.\n",
    "- Finally, in **Parts 6 and 7**, we tackled the crucial step of **Evaluation**. We learned how to measure our system’s performance with key metrics like faithfulness and correctness, both by building evaluators from scratch and by using powerful frameworks like deepeval, grouse, and RAGAS.\n",
    "\n",
    "> In case you enjoy this blog, feel free to **[follow me on Medium](https://medium.com/@fareedkhandev)** I only write here."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
